A different view of the logistic regression

We can think of a logistic regression as model in which individuals have a latent trait that determines, given a threshold, if they fall in one category or another. For instance, latent scholastic aptitude is a trait that (co-) determines if a student passes a class:

threshold = -2
set_par(cex = 1.5)
curve(dlogis(x),-6,6, xlim = c(-5,5),
      xlab = "latent scholastic aptitude")
x = seq(threshold,5,.1)
shade(dlogis(x)~x,c(threshold,6), col = adjustcolor("green3",.5))
x = seq(-6,threshold,.1)
shade(dlogis(x)~x,c(-6,threshold), col = adjustcolor("red3",.5))
abline(v = threshold, lty = 2, lwd = 2)

In this model, the latent variable is distributed according to the logistic distribution.

We can calculate the log odds of an outcome as before, this time using probabilities.

The odds of an event represent the likelihood that the event will occur compared to the likelihood that it will not. Odds are usually expressed as ratios.

\[ \textrm{log odds} = log \left( \frac{P_{fail}}{P_{pass}} \right) = log \left( \frac{P_{fail}}{1-P_{fail}} \right) \] To get the probability to fail at a certain threshold, we use the cumulative probability function of the logistic distribution, also called the inverse logit.

In R we can use boot::inv.logit or plogis to get the cumulative density function of the logistic distribution.

set_par(cex = 1.5)
curve(plogis(x),-6,6,xlim = c(-5,5),
      xlab = "latent scholastic aptitude",
      ylab = "plogis(x) = inv_logit(x)")
curve(plogis(x),-6,-2,col = "red", add = T, lwd = 1.5)
curve(plogis(x),-2,6,col = "green3", add = T, lwd = 1.5)
lines(c(-6,threshold),rep(plogis(threshold),2), lty = 2, col = "red")
lines(rep(threshold,2),c(0,plogis(threshold)), lty = 2, col = "red")
abline(h = c(0,1), lty = 2)

P_fail = plogis(threshold)      # fail prob. from thresh. (-2)
P_pass = 1-P_fail               # Pass probability
log_odds = log(P_fail/P_pass)   # calculate log odds
log_odds                        # log odds = threshold!
## [1] -2

This is exactly our threshold and explains why the intercept coefficient of an intercept-only logistic regression for data with a fail-rate of 12% will be qlogis(.12) = -2.

The following figure shows that we can use the cumulative distribution function (plogis or inv.logit) to go from a threshold value to success probabilities and the quantile function (qlogis or logit) to go from success probabilities to thresholds.

set_par(mfrow = c(1,2), cex = 1.5)
curve(plogis(x),-5,5, xlab = "threshold", ylab = "plogis(x) or inv.logit(x)")
arrows(x0 = threshold, y0 = 0, y1 = plogis(threshold), col = "blue", lwd = 2, length = .2)
arrows(x0 = threshold, x1 = -5.4, y0 = plogis(threshold), col = "blue", lwd = 2, length = .2)
text(-5.4,plogis(threshold),labels = round(plogis(threshold),2), pos = 2, col = "blue", xpd = TRUE, cex = .75)
abline(h = c(0,1), lty = 2)
curve(qlogis(x),0,1, xlab = "cumulative probability", ylab = "qlogis(x) or logit(x)")
arrows(x0 = plogis(threshold), y0 = -5, y1 = threshold, col = "blue", lwd = 2, length = .2)
arrows(y0 = threshold, x0 = plogis(threshold), x1 = -0.04, col = "blue", lwd = 2, length = .2)
text(-5.4,plogis(threshold),labels = round(plogis(threshold),1), pos = 2, xpd = TRUE)
text(plogis(threshold),-5.4,labels = round(plogis(threshold),2), pos = 1, col = "blue", xpd = TRUE, cex = .75)
abline(v = c(0,1), lty = 2)


Now let us assume that there is another variable \(\small X\) that increases the probability to pass the class, so that also students with lower latent scholastic aptitude of -3 can pass a class (could be class difficulty).

In a logistic regression we estimate the effect of \(\small X\) as follows:

\[ log \left( \frac{P_{fail}}{P_{pass}} \right) = \alpha + \beta \cdot X \]

Here, \(\small \alpha\) is the intercept and baseline log-odds to pass the test (due to latent scholastic aptitude) and \(\small \beta\) captures the change in log odds of passing the class due to \(\small X\). Because the log odds can also be understood as thresholds on top of a logistic-distributed latent variable, you can also think of \(\small \alpha\) and \(\small \beta\) together determining the threshold value for passing and the thus the probability to pass.

Because our baseline threshold / log odds is -2 and the threshold for children with variable X is -3, we calculate beta as the difference of the two thresholds:

\[ \beta = -3-(-2) = -1 \]

So if we simulate data according to this data generating process and estimate a logistic regression, we should find that \(\small \alpha = -2\) and \(\small \beta = -1\).

Lets simulate the data:

set.seed(1)
N = 10000
thresholds = rep(-2,N)
X = sample(c(0,1), N, replace = TRUE)
thresholds[X==1] = -3
trait_values = rlogis(N)
pass = trait_values > thresholds

We estimate the model:

q.fit = quap(
  alist(
    pass ~ dbinom(1,p),
    logit(p) ~ -(a + b*X),
    a ~ dnorm(0,3),
    b ~ dnorm(0,3)
  ),
  data = list(pass = pass, X = X)
)
precis(q.fit) %>% 
  round(2)
##    mean   sd  5.5% 94.5%
## a -1.96 0.04 -2.03 -1.89
## b -1.07 0.08 -1.19 -0.94

Due to simulation noise when generating trait_values we are not exactly recovering the parameters, but we are getting very close to the correct thresholds with -1.96 at baseline and a + b = -1.96 + -1.07 = -3.03 for individuals where X = 1.

This shows us again that regression weights in logistic regressions represent changes in log odds ratios, and thus also changes in the threshold or success probability, due to a predictor variable. As we discussed in the last class, these changes are due to the exp in the link function multiplicative on the probability scale.

Cumulative ordinal logistic regression

A simple threshold / intercepts only model

The ordinal logistic regression is named as such, because just like logistic regression, it assumes a latent logistic variable that determines responses. Differently than the standard logistic regression, ordered logistic regression uses not one but multiple thresholds in order to map a latent trait (variable) onto ordered responses.

Lets visualize this with 3 thresholds of our choice for \(\small M=4\) categories.

thresholds = c(-1.24,.25,2)
thresholds
## [1] -1.24  0.25  2.00
plot_dlogis.thresh = function(threshs, xlim = c(-5,5), plot.text = TRUE) {
  threshs = as.numeric(threshs)
  x.st = xlim[1]-1
  x.en = xlim[2]+1
  curve(dlogis(x),x.st,x.en,xlim = xlim,
        xlab = "latent scholastic aptitude")
  n_cat = length(threshs) + 1
  clrs =
    colorRampPalette(c("blue","blue4"))(n_cat)
  for (k in 1:n_cat) {
    st = ifelse(k == 1, x.st, threshs[k-1])
    en = ifelse(k == n_cat, x.en, threshs[k])
    x = seq(st,en,length.out = 50)
    polygon(c(min(x),x,max(x)), c(0,dlogis(x),0), col = clrs[k])
    arrows(x0 =  threshs[k], y0 = 0, y1 = dlogis(0), lty = 2, col = "red", lwd = 2,length = 0)
    if (plot.text == TRUE) {
      text((st+en)/2,dlogis(0)/2,paste0("Y=",k), col = "grey50", cex = 1.5)
      text(threshs[k],dlogis(0),round(threshs[k],2), col = "red", pos = 2)
    }
  }
}
n_cat = length(thresholds) + 1
clrs = colorRampPalette(c("blue","blue4"))(n_cat)
set_par(cex = 1.5)
plot_dlogis.thresh(thresholds)

Just as we can use the (cumulative) logistic distribution to recover threshold values (log odds) for the simple logistic regression, we can use it to recover thresholds for an ordered logistic model.

For instance, when 22.4% of individuals respond with Y = 1:

threshold_1 = log(.224/(1-.224))  # note the log odds
threshold_1 %>% round(2)
## [1] -1.24

As you might have guessed, we can also read the cumulative probabilities to give a response up to a level \(\small k\) given \(\small k-1\) thresholds from the cumulative logistic distribution.

Then, we obtain a categorical distribution by calculating the probability of a response as the difference of the cumulative probabilities at the thresholds that enclose that category. This categorical distribution that is used to calculate the likelihood for ordinal responses.

plot_plogis.thresh = function(thresholds, plot.thrsh.eq = FALSE, xlim = c(-5,5)) {
  x.st = xlim[1]-1
  x.en = xlim[2]+1
  n_cat = length(thresholds) + 1
  clrs = 
    colorRampPalette(c("blue","blue4"))(n_cat) 
  curve(plogis(x),x.st,x.en,xlim = c(-5,5),
        xlab = "latent scholastic aptitude")
  for (k in 1:n_cat) {
    s = ifelse(k == 1, x.st, thresholds[k-1])
    e = ifelse(k == n_cat, x.en, thresholds[k])
    curve(plogis(x), s,e, col = clrs[k], add = T, lwd = 2)
    
    b = ifelse(k == 1, 0, plogis(thresholds[k-1]))
    t = ifelse(k == n_cat, 1, plogis(thresholds[k]))
    x = seq(s,e,.01)
    y.p = c(plogis(x), plogis(max(x)))
    x.p = c(x,x.en)
    polygon(c(x.p,x.en),c(y.p,plogis(x[1])), col = clrs[k], border = "NA")
    arrows(x0 = -4, y1 = t, y0 = b, angle = 90, code = 3, length = .15)
    
    prob.level = 
      ifelse(
        k == 1,
        plogis(thresholds[k]),
        ifelse(
          k == n_cat,
          1-plogis(thresholds[k-1]),
          plogis(thresholds[k])-plogis(thresholds[k-1]))) %>% 
      round(2)
    text(-4, (b+t)/2, paste0("P(Y=",k,")=", prob.level), pos = 4, cex = .8)
    
    if (k < n_cat) {
      p=plogis(thresholds[k])
      thrsh = log(p/(1-p)) %>% round(2)
      thrsh.equ = bquote(frac(.(round(p,2)),.(round(1-p,2)))~"=exp("~.(thrsh)~")")
      if (plot.thrsh.eq == TRUE) text(5.5,p,thrsh.equ, xpd = T, pos = 4, cex = .8)
      arrows(x0 = thresholds[k], y0 = 0, y1 = p, col = "red", lty = 2)
      arrows(x0 = thresholds[k], x1 = -5.4, y0 = p, col = "red", lty = 2)
    }
  }
  points(thresholds,rep(0,n_cat-1), pch = "|", col = "red")
}
set_par(cex = 1.5, mar = c(3,3,.5,7))
plot_plogis.thresh(thresholds, plot.thrsh.eq = TRUE)

Let’s again simulate data from the model and estimate the thresholds.

N = 1000
Y = cut(
  rlogis(1000),                          # latent variable
  breaks = c(-Inf,thresholds,Inf)) %>%   # thresholds
  as.numeric()

which we can show as histogram and as cumulative counts:

set_par(mfrow = c(1,2))
Y %>% 
  table() %>% 
  barplot(ylab = "N",
          col = clrs,
          xlab = "Response")

m = diag(4)
m[upper.tri(m)] = 1
m = apply(m,2, function(x) x*Y %>% table())
x = barplot(m, col = clrs,
        ylab = "cumulative N",
        xlab = "Response",
        names.arg = 1:n_cat)
ps = c(plogis(thresholds[1]),diff(c(plogis(thresholds),1)))
cs = c(0,plogis(thresholds),1)
for (k in 1:n_cat)
  text(tail(x,1),
       mean(cs[k:(k+1)])*N,
       paste(round(ps[k]*100),"%"),
       col ="white",
       cex = .5)

Here is an ordered logistic regression model:

bol.model = 
  alist(
    Y ~ dordlogit(0,thresholds),
    thresholds ~ dnorm(0,3)
  )

The key difference between the simple logistic and the cumulative ordered logistic regression with responses \(\small Y_i \in \{0,1\}\) is that the former has only one threshold parameter:

\[ log \left( \frac{P(Y_i = 1)}{P(Y_i = 0)} \right) = \alpha \]

and the latter has, given responses \(\small Y_i \in \{1, 2, ..., M\}\) in ordered categories, \(\small M-1\) threshold parameters:

\[ log \left( \frac{P(Y_i>k)}{P(Y_i<k)} \right) = \alpha_k \]

with responses \(\small k \in \{1,2...M\}\)

The dordlogit function in the rethinking package calculates the likelihood of the observations given the model by using the plogis / inv_logit link functions and threshold values to calculate likelihoods from a categorical distribution.. The implicit latent variable remains unchanged, but thresholds are adjusted to maximize the posterior probability of the data.

set_par(mfrow = c(2,2))

thresh_list = list(thresholds,thresholds-1.25)

for (k in 1:2) {
  thres = thresh_list[[k]]
  plot_plogis.thresh(thres)
  rbind(Y %>% 
          table %>% 
          prop.table(),
        rlogis(100) %>% 
          cut(c(-Inf,thres,+Inf)) %>% 
          table %>% prop.table()) %>% 
    barplot(beside = T, 
            col = c("grey50","blue"),
            ylab="proportion",
            xlab = "Response")
  legend("topleft", fill = c("grey50","blue"), 
         legend = c("data",expression("model | "~theta)), bty = "n")
  
}

Moving the thresholds changes the probability / likelihood to observe the different categories.

The basic ordered logistic regression model without predictors finds the threshold values that allow to reproduce the observed distribution of ratings.

We fit the model.

bol.fit = ulam(
  bol.model,
  data=list(Y=Y),
  chains=4,cores=4)

And here are the estimated thresholds, with the true thresholds as vertical, dotted, blue lines.

precis(bol.fit, depth = 2) %>% plot()
threshold.est = precis(bol.fit, depth = 2)[,1]
text(threshold.est,
     3:1,
     round(threshold.est,2), pos = 3)
abline(v = thresholds,col = "blue", lty = 3)

As expected, we recover the correct threshold values -1.24, 0.25, 2 (with some error/bias due to simulation and prior).

Ordinal logistic regression with predictors

To understand how to set up a model such that a variable can increase or decrease the probability of higher response levels, lets look at the plot of latent variable and thresholds again. The plot also shows a hypothetical person with an average latent scholastic aptitude of 0 as a white dot.

set_par(cex = 1.5)
plot_dlogis.thresh(thresholds)
plotrix::draw.circle(0,.005,radius = .1,col = "white")

If we want to increase the chance that this hypothetical person achieves a rating of Y = 3, we can either increase the value of the latent variable or shift the the thresholds to the left:

set_par(mfrow = c(1,2),cex = 1.25)
plot_dlogis.thresh(thresholds)
arrows(x0 = 0, x1 = 1, y0 = .005, length = .1, col = "yellow")
plotrix::draw.circle(1,.005, radius = .1,col = "white")
plotrix::draw.circle(0,.005, radius = .1,col = adjustcolor("white",alpha = .75))


thresholds.b = thresholds - 1
plot_dlogis.thresh(thresholds.b)
abline(v = thresholds, col = adjustcolor("red",alpha = .25), lty = 2)
arrows(x0 = thresholds, x1 = thresholds.b,
       y0 = c(seq(.025,.075,length.out = 3)),
       col = "yellow", length = .1)
plotrix::draw.circle(0,.005,radius = .1,col = "white")

If we shift the latent value or the thresholds by the same amount, the probability of a higher response rises equally. In both plots above the bright white dot is in in the middle between the 2|3 and 3|4 thresholds.

Therefore we can, as we saw earlier for the logistic regression, add terms for individual level effects to the basic threshold / intercept only model to capture the effect of predictor variables on responses:

\[ log \left( \frac{P(Y_i>k)}{P(Y_i<k)} \right) = \alpha_k + \beta \cdot X_i \] where \(\small \alpha_k\) are thresholds and \(\beta\) captures the effect of the variable \(X\) by modifying each individual’s threshold.

Now we have seen previously that a threshold is just the log odds of responding in a category above vs below the threshold. Therefore, a positive (negative) regression weight in an ordinal logistic regression should be interpreted as the log odds ratio to give a one level higher (lower) response, i.e. Y = 4 instead of Y = 3 (Y = 2 instead of Y = 3).

Because there is only one regression weight \(\small \beta\) per variable that is added to all thresholds, the log odds ratios are the same for all category transitions, i.e. 

\[ \small log \left( \frac{P(Y_i>1|X_i = 0)}{P(Y_i<1|X_i = 1)} \right) = log \left( \frac{P(Y_i>2|X_i = 0)}{P(Y_i<2|X_i = 1)} \right) = log \left( \frac{P(Y_i>3|X_i = 0)}{P(Y_i<3|X_i = 1)} \right) \]

This is referred to as the proportional odds assumptions, which may or may not be true for the data at hand. There are alternative models that do not make this assumption. Here is a good paper that describes these models: Ordinal Regression Models in Psychology: A Tutorial

An example

To see the ordered logistic regression in action, we’ll use the example data from the Portuguese school1. In particular, we are looking at the final grade and estimate again the association with maternal education.

df=read.table("../Chapter11/data/student-mat.csv",sep=";",header=TRUE)
df = df[df$Medu>0,]
set_par()
df$G3 = as.numeric(cut(df$G3, breaks = seq(0,20,2),include.lowest = T))
df$G3 %>% table() %>% barplot()

Yes, cumulative ordinal logistic regressions can also handle multimodal data! But there are limits to this ability.

To recapitulate that thresholds are just log odds at category boundaries, lets calculate them:

# 1. Cumulative probability of all classes
cum_prob = df$G3 %>% table() %>% prop.table() %>% cumsum()
cum_prob = cum_prob[cum_prob!=1]
# calculate logg odds
simple_thresholds = log(cum_prob/(1-cum_prob))
simple_thresholds %>% round(2)
##     1     2     3     4     5     6     7     8     9 
## -2.23 -2.20 -1.69 -1.04 -0.11  0.71  1.51  2.73  4.16

And we estimate a thresholds only model with ulam:

tm.fit = ulam(
  alist(
    G3 ~ dordlogit(0,thresholds),
    thresholds ~ dnorm(0,3)),
  data=list(G3 = df$G3),
  chains=4,cores=4)

Now we can plot the manually calculated thresholds against those estimated with ulam:

post = extract.samples(tm.fit)
ulam_thresholds = colMeans(post$thresholds)
CI = apply(post$thresholds, 2, PI)
set_par(cex=1.5)
plot(simple_thresholds,
     ulam_thresholds,
     ylim = range(CI),
     pch = 16, col = "blue")
arrows(x0 = simple_thresholds,
       y0 = CI[1,], y1 = CI[2,],
       col = "blue", length = 0)
abline(0,1,lty = 3, col = "grey")

Now that we have understood thresholds, we can implement a model that uses predictors. Specifically, we estimate the effect of past failure and maternal education on grades and allow an interaction between these two variables. Ordinal logistic regressions allow differences in the magnitude of slopes between groups without modelling interactions explicitly. But we do need to model interactions explicitly if we want to allow slopes with different signs for different groups.

ol.model = 
  alist(
    G3 ~ dordlogit(phi,thresholds),
    phi <- bE*(Medu-2.5) + iFE*failures,
    iFE <- bF + bFE*(Medu-2.5), # interaction
    c(bF, bE, bFE) ~ normal(0,1),
    thresholds ~ dnorm(0,3)
  )

We fit the model.

ol.fit = ulam(
  ol.model,
  data=list(G3 = df$G3, Medu = df$Medu, failures = df$failures),
  chains=4,cores=4)

We first just check Rhat values to make sure the model converged:

precis(ol.fit, depth = 2) %>% round(3)
##                 mean    sd   5.5%  94.5%    n_eff Rhat4
## bFE           -0.192 0.116 -0.375 -0.006 2546.960 1.000
## bE             0.345 0.090  0.200  0.494 2356.090 0.998
## bF            -0.861 0.134 -1.080 -0.656 2307.201 1.000
## thresholds[1] -2.702 0.200 -3.023 -2.397 1481.249 1.001
## thresholds[2] -2.641 0.197 -2.956 -2.340 1541.319 1.000
## thresholds[3] -2.074 0.164 -2.336 -1.815 1913.871 1.000
## thresholds[4] -1.351 0.137 -1.573 -1.139 2438.878 0.999
## thresholds[5] -0.296 0.114 -0.479 -0.117 2392.613 0.999
## thresholds[6]  0.619 0.117  0.436  0.809 2363.338 1.001
## thresholds[7]  1.477 0.139  1.260  1.704 2410.789 0.999
## thresholds[8]  2.750 0.217  2.422  3.112 2325.046 0.998
## thresholds[9]  4.241 0.400  3.654  4.924 2385.215 1.001

This looks OK.

As a quick posterior predictive check we compare the histogram of the observed and predicted grades. We use the sim function instead of the link function to get posterior predictions on the scale of the ordered outcome variable.

set_par()

pp = sim(ol.fit)
obs.counts = cut(df$G3, breaks = seq(.5,10.5,1)) %>% table()
plot(0:10,c(0,obs.counts), 'S', xaxt = "n", ylim = c(0,110), ylab = "grade") 
axis(1,at = (1:10)-.5, labels = 1:10, lwd = 2)
for (k in 1:250) {
  pp.counts = cut(pp[k,], breaks = seq(.5,10.5,1)) %>% table()
  lines(0:10, c(0,pp.counts), 'S', col = adjustcolor("blue",alpha = .25))
}
lines(0:10,c(0,obs.counts), 'S', lwd = 2) 

We can see that the ordinal logistic model has no problems modeling this multimodal response distribution.

Lets quickly compare the thresholds with thresholds from a thresholds-only model again:

post = extract.samples(ol.fit)
ulam_thresholds = colMeans(post$thresholds)
CI = apply(post$thresholds, 2, PI)
set_par(cex=1.5)
plot(simple_thresholds,
     ulam_thresholds,
     ylim = range(CI),
     pch = 16, col = "blue")
arrows(x0 = simple_thresholds,
       y0 = CI[1,], y1 = CI[2,],
       col = "blue", length = 0)
abline(0,1,lty = 3, col = "grey")

Because the model now also estimates effects of education and past failures, the thresholds are set off a bit 2. Only if the covariates would be independent of grades would we expect to see identical thresholds.

Now lets look at the coefficients of interest:

coeffs = precis(ol.fit, pars = c("bE","bF","bFE"))
coeffs %>% plot()

coeffs %>% round(2)
##      mean   sd  5.5% 94.5%   n_eff Rhat4
## bE   0.35 0.09  0.20  0.49 2356.09     1
## bF  -0.86 0.13 -1.08 -0.66 2307.20     1
## bFE -0.19 0.12 -0.38 -0.01 2546.96     1

What do these results mean?3

  • for each level of maternal education the odds ratio to get a one level higher grade is exp(0.35) = 1.41
  • for each past fail the odds ratio to get a one level higher grade is exp(-0.86) = 0.42.
  • the “effect” of past fails depends on maternal education, with a log odd ratio of -0.19, though the 90% credible interval comes close to zero.

It is a bit hard to interpret these results because they are on the log odds ratio scale and because it is not totally clear (to me) how to interpret interaction effects. Therefor, we use posterior predictions to understand the results better.

First, we create posterior predictions for all levels of maternal education and past fails:

Lets look at the data for which we simulate outcomes:

# create all combinations of Medu and failures
sim.data = expand.grid(
  Medu = sort(unique(df$Medu)),
  failures = sort(unique(df$failures)))

# use the function sim to generate posterior predictions
pp = sim(ol.fit,data = sim.data, n = 2000)
sim.data
##    Medu failures
## 1     1        0
## 2     2        0
## 3     3        0
## 4     4        0
## 5     1        1
## 6     2        1
## 7     3        1
## 8     4        1
## 9     1        2
## 10    2        2
## 11    3        2
## 12    4        2
## 13    1        3
## 14    2        3
## 15    3        3
## 16    4        3

Here is a general overview over the model predictions:

Note the highly uncertainty in the posterior predictions!

mu = colMeans(pp)
CI80 = apply(pp,2,function(x) PI(x, prob = .8))
CI90 = apply(pp,2,function(x) PI(x, prob = .9))
set_par(cex = 1.5)
plot(0,type = "n", xlim = c(0.5,4.5),
     ylim = range(CI90), xaxt = "n",
     xlab = "previous fails",
     ylab = "grade")
axis(1,at = 1:4, labels = 0:3)

for (Medu in 1:4) {
  idx = sim.data$Medu == Medu
  x = (1:4)+(Medu-2.5)/6
  points(x,mu[idx], pch = 16, col = Medu)
  arrows(x0 = x, y0 = CI80[1,idx], y1 = CI80[2,idx], length = 0, col = Medu, lwd = 2)
  arrows(x0 = x, y0 = CI90[1,idx], y1 = CI90[2,idx], length = 0, col = Medu)
}
legend("topright", pch = 16, col = 1:4, legend = 1:4, title = "Medu", bty = "n")

This plot suggests several questions:

  • Do grades decrease from 0 to 3 previous fails?
  • Is higher maternal education associated with higher grades when there were 0 previous fails and with lower grades when there were 3 fails?
  • Are the above mentioned slopes robustly different?

Above, we used sim(ol.fit) to easily obtain predicted response values. However, these are a bit noisy, because thy are simulated randomly given the response probabilities phi and the thresholds. We can be more accurate by using only phi, and thresholds to calculate response-probabilities and then calculate expected responses by weighting the responses with these probabilities. This does not involve simulation of individual level responses and is therefore less noisy:

phi = link(ol.fit,data = sim.data)$phi
thresholds = extract_post_ulam(ol.fit)$thresholds
pp = matrix(NA, nrow = nrow(thresholds), ncol = nrow(sim.data))
for (k in 1:nrow(sim.data)) {
  log_odds_ratio = apply(thresholds,2, function(x) x - phi[,k])
  cum_prob = apply(log_odds_ratio,1, function(x) plogis(x)) %>% t()
  cum_prob = cbind(cum_prob,1)
  res_prob = cum_prob - cbind(0,cum_prob[,1:(ncol(cum_prob)-1)])
  pp[,k] = apply(res_prob,1, function(x) sum(x*(1:10))) # expected grade
}
To answer the first question, “Do grades decrease from 0 through 3 previous fails?”, we calculate for 1,2,3 previous fails if one less previous fail had lead to a better grade and average over these 3 comparisons:

The minimum number of previous fails is 0.

\[ \delta = \frac{1}{3}\sum^i_{1:3} G3(fails = i) - G3(fails = i-1) \] where \(\small G3(fails = i)\) is the expected grade give i previous fails.

posterior_hist = function(x,label, set_par = TRUE, xlim = NULL) {
  if (set_par == T) set_par(cex = 1.5, mar=c(3,3,3,.5))
  tlt = paste0(round(mean(x),2),", (",paste(round(PI(x),2), collapse = ", "),
               ")\n P(",label,">0) = ", round(mean(x>0),3))
  hist(x, ylab = label, xlim = xlim, xlab = label,
       main = tlt, cex.main = 1)
  abline(v = 0, col = "red", lty = 3)
  abline(v = (PI(x)), col = "blue", lwd = 2)
}
delta = rep(0,nrow(pp))
# pp is a matrix with posterior predictions
# for the data in sim.data
for (i in 1:3) { # i is number of previous fails
  delta = delta +
    pp[,which(sim.data$failures == i)] - 
    pp[,which(sim.data$failures == i-1)]
}
delta = delta / 3
posterior_hist( # un-hide previous code block for this function
  delta,
  "delta fails",
  xlim = range(delta))

On average grades decreased by -0.96 for each previous fail.

To answer the second question, “Is higher maternal education associated with higher grades when there were 0 previous fails and with lower grades when there were 3 fails?”, we first calculate the effect of maternal education for students with 0 and 3 previous fails:

# calculate contrasts
delta_F0 = rep(0,nrow(pp))
delta_F3 = rep(0,nrow(pp))
# we loop over maternal education levels 2-4
# each time calculating the difference to the next lower level
for (i in 2:4) { # here, i is the maternal education level
  delta_F0 = delta_F0 +
    pp[,which(sim.data$Medu == i & sim.data$failures == 0)] - 
    pp[,which(sim.data$Medu == i-1 & sim.data$failures == 0)]
  delta_F3 = delta_F3 +
    pp[,which(sim.data$Medu == i & sim.data$failures == 3)] - 
    pp[,which(sim.data$Medu == i-1 & sim.data$failures == 3)]
}
# average effect of increasing Medu by 1 at 0 previous fails
delta_F0 = delta_F0 / 3 
# average effect of increasing Medu by 1 at 3 previous fails
delta_F3 = delta_F3 / 3

We see that whereas higher maternal education is associated with better grades when the student had not failed the class yet, there is very weak evidence for the opposite effect for students who failed the class 3 times already. To see if the two effects are really different, we need to compare them explicitly, which we do next:

We calculate the difference between these two deltas:

interaction_eff = delta_F0-delta_F3
posterior_hist(
  interaction_eff,
  "delta(fails=0)-delta(fails=3)",
  xlim = range(interaction_eff))

We are reasonably certain that there is a difference in slopes, but we can’t be sure.

What about ordered probit?

The ordered probit uses the standard normal distribution for the latent variable. The only disadvantage I see with the ordered probit is that the coefficients cannot be interpreted as (log) odds ratios.

However, one can then interpret regression coefficients as changes on the level of the latent variable on the scale of a standard normal distribution. This can be more attractive, if one is primarily interested in the latent variable. If one is primarily interested in the relative frequency of the response categories the ordered logistic model is easier to interpret.

Ordered predictors

The basic idea for an ordered predictor with \(\small M\) levels is as follows:

  1. Set the value for the standard effect to 0 for the lowest level (\(\small \lambda_1 = 0\)) and to 1 for the highest level (\(\small \lambda_k = 1\)).
  2. Estimate parameters \(\small \lambda_2 \textrm{ to } \lambda_{k-1}\) that describe how much of the maximal effect if realized at levels \(\small 2 \textrm { to } k-1\), such that \(\small \lambda_1 \textrm{ to } \lambda_{k}\) increase monotonically.
  3. Estimate a parameter \(\small \beta_o\) that scales the \(\lambda\)s such that the effect of a given level is calculated as \(\small \lambda_k\beta_o\). Than the effect at the higest level of the ordered predictor will be \(\small \beta_o\).

This approach to ordered predictors can be used for any kind of outcome variable.

Over- and underdispersion

Summary


  1. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. The data can be downloaded here↩︎

  2. Not centering predictor variables leads to additional differences↩︎

  3. remember that because thresholds are changed as \(\small\alpha - \beta X\) positive values of \(\small \beta\) mean that higher positive values x move the thresholds to the left and thus make higher-level responses more likely.↩︎